chain-of-thought prompting
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM's process for solving a task. This level of transparency into LLMs' predictions would yield significant safety benefits. However, we find that CoT explanations can systematically misrepresent the true reason for a model's prediction. We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs--e.g., by reordering the multiple-choice options in a few-shot prompt to make the answer always (A)--which models systematically fail to mention in their explanations.
Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology
Anglin, Kylie L., Milan, Stephanie, Hernandez, Brittney, Ventura, Claudia
Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording of the prompt. While literature on prompt engineering is expanding, few studies focus on classification tasks, and even fewer address domains like psychology, where constructs have precise, theory-driven definitions that may not be well represented in pre-training data. We present an empirical framework for optimizing LLM performance for identifying constructs in texts via prompt engineering. We experimentally evaluate five prompting strategies --codebook-guided empirical prompt selection, automatic prompt engineering, persona prompting, chain-of-thought reasoning, and explanatory prompting - with zero-shot and few-shot classification. We find that persona, chain-of-thought, and explanations do not fully address performance loss accompanying a badly worded prompt. Instead, the most influential features of a prompt are the construct definition, task framing, and, to a lesser extent, the examples provided. Across three constructs and two models, the classifications most aligned with expert judgments resulted from a few-shot prompt combining codebook-guided empirical prompt selection with automatic prompt engineering. Based on our findings, we recommend that researchers generate and evaluate as many prompt variants as feasible, whether human-crafted, automatically generated, or ideally both, and select prompts and examples based on empirical performance in a training dataset, validating the final approach in a holdout set. This procedure offers a practical, systematic, and theory-driven method for optimizing LLM prompts in settings where alignment with expert judgment is critical.
- North America > United States > Connecticut (0.04)
- Asia > India (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Education (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Can we use LLMs to bootstrap reinforcement learning? -- A case study in digital health behavior change
Albers, Nele, de Groot, Esra Cemre Su, Keijsers, Loes, Hillegers, Manon H., Krahmer, Emiel
Personalizing digital applications for health behavior change is a promising route to making them more engaging and effective. This especially holds for approaches that adapt to users and their specific states (e.g., motivation, knowledge, wants) over time. However, developing such approaches requires making many design choices, whose effectiveness is difficult to predict from literature and costly to evaluate in practice. In this work, we explore whether large language models (LLMs) can be used out-of-the-box to generate samples of user interactions that provide useful information for training reinforcement learning models for digital behavior change settings. Using real user data from four large behavior change studies as comparison, we show that LLM-generated samples can be useful in the absence of real data. Comparisons to the samples provided by human raters further show that LLM-generated samples reach the performance of human raters. Additional analyses of different prompting strategies including shorter and longer prompt variants, chain-of-thought prompting, and few-shot prompting show that the relative effectiveness of different strategies depends on both the study and the LLM with also relatively large differences between prompt paraphrases alone. We provide recommendations for how LLM-generated samples can be useful in practice.
- Europe > Austria > Vienna (0.14)
- Europe > Netherlands > South Holland > Rotterdam (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (11 more...)
RubiSCoT: A Framework for AI-Supported Academic Assessment
Fröhlich, Thorsten, Schlippe, Tim
The evaluation of academic theses is a cornerstone of higher education, ensuring rigor and integrity. Traditional methods, though effective, are time-consuming and subject to evaluator variability. This paper presents RubiSCoT, an AI-supported framework designed to enhance thesis evaluation from proposal to final submission. Using advanced natural language processing techniques, including large language models, retrieval-augmented generation, and structured chain-of-thought prompting, RubiSCoT offers a consistent, scalable solution. The framework includes preliminary assessments, multidimensional assessments, content extraction, rubric-based scoring, and detailed reporting. We present the design and implementation of RubiSCoT, discussing its potential to optimize academic assessment processes through consistent, scalable, and transparent evaluation.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- Europe > Germany > North Rhine-Westphalia > Cologne Region > Cologne (0.04)
- Instructional Material (1.00)
- Overview (0.95)
- Research Report > Experimental Study (0.69)
- Education > Educational Technology > Educational Software > Computer-Aided Assessment (1.00)
- Education > Educational Setting (1.00)
- Education > Assessment & Standards (1.00)
- North America > United States > Ohio (0.04)
- Oceania > New Zealand (0.04)
- Oceania > Cook Islands (0.04)
- (5 more...)
- North America > United States > Ohio (0.04)
- Oceania > New Zealand (0.04)
- Oceania > Cook Islands (0.04)
- (5 more...)
- North America > United States > Ohio (0.28)
- Europe > Germany (0.27)
- North America > United States > Texas > Travis County > Austin (0.27)
- (12 more...)
- Research Report > New Finding (0.92)
- Research Report > Experimental Study (0.67)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- (3 more...)
- North America > United States > Pennsylvania (0.04)
- Asia > Vietnam (0.04)
- North America > Mexico (0.04)
- Media (0.93)
- Consumer Products & Services (0.67)
- Leisure & Entertainment > Sports > Basketball (0.46)
- Leisure & Entertainment > Sports > Football (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > Jordan (0.04)
More or Less Wrong: A Benchmark for Directional Bias in LLM Comparative Reasoning
Shafiei, Mohammadamin, Saffari, Hamidreza, Moosavi, Nafise Sadat
Large language models (LLMs) are known to be sensitive to input phrasing, but the mechanisms by which semantic cues shape reasoning remain poorly understood. We investigate this phenomenon in the context of comparative math problems with objective ground truth, revealing a consistent and directional framing bias: logically equivalent questions containing the words ``more'', ``less'', or ``equal'' systematically steer predictions in the direction of the framing term. To study this effect, we introduce MathComp, a controlled benchmark of 300 comparison scenarios, each evaluated under 14 prompt variants across three LLM families. We find that model errors frequently reflect linguistic steering, systematic shifts toward the comparative term present in the prompt. Chain-of-thought prompting reduces these biases, but its effectiveness varies: free-form reasoning is more robust, while structured formats may preserve or reintroduce directional drift. Finally, we show that including demographic identity terms (e.g., ``a woman'', ``a Black person'') in input scenarios amplifies directional drift, despite identical underlying quantities, highlighting the interplay between semantic framing and social referents. These findings expose critical blind spots in standard evaluation and motivate framing-aware benchmarks for diagnosing reasoning robustness and fairness in LLMs.
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (7 more...)
- Health & Medicine > Consumer Health (0.67)
- Education (0.67)